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Abstract 

Computers and routers on the Internet send each other error mes- 
sages (called ICMP datagrams) to signal conditions such as network 
congestion or blackouts. While these datagrams are very rare, less than 
0.001% of total traffic, they hold very important global information about 
problems and congestions elsewhere in the Net. A measurement of the 
flow of such error messages in our local cluster shows a very pathological 
distribution of inter-message times: P(At) « 1/At. This scaling extends 
for about seven decades, and is only punctuated by extraneously periodic 
signals from automatons. More than a half of these error messages were 
themselves generated erroneously. 

The decentralized and anarchich structure of the Internet has, for a long 
time, invited people to perform all sorts of measurements. Indeed, it has taxed 
and even challenged our ability to make such measurements: for instance, it's 
been several years since we last knew, to any accuracy, how large the Inter- 
net actually is jl], Many types of measurements have shown "anomalous" 
or "pathologic" statistics, meaning that the distribution of some quantity has 
power-law (a.k.a. Pareto or "heavy" ) tails . A power-law tail implies that the 
probability of an outlier diminishes very slowly with the size of the outlier, 
and hence measurements show strong irregularities. While anomalous from a 
classical statistics viewpoint, this is a very usual phenomenon in the natural 
sciences, and has been extensively studied 0|; in particular, one of the earli- 
est examples concerns noise in communication channels pi. As applied to the 
Internet, local traffic measurements at the datagram level [?], ||[ p|, fio|| , session 
data jn|, webserver workloads fl| ||, USENET thread length and Web 
surfing patterns Jl5| all show these tails. There is ample evidence that these 
statistical pathologies do not come from intrinsic instabilities of the Net as a 
communications system, but rather stem from the way in which people use the 
Net. Power-law distributions are often regarded as prima-facie evidence of some 



self-organizational process, but they can also simply be a reflection of other 
underlying power laws. The sizes of all files on a computer follow a power-law 
distribution, from small files a few characters in size, to large datasets (e.g., 
videos) hundreds of megabytes in size. It is thus natural that, for instance, 
the sizes of documents retrieved from webservers reflect this underlying breadth 
p3j, and it's also been shown that individual surfers follow anomalous Levy- 
flight- like surfing patterns |l5[ ] . A useful and clarifying distinction was made in 
[ 111] : the initiations of "sessions" (such as telnet or ftp) follow perfectly normal 
Poissonian patterns, while the individual transactions within a session (the ft- 
pdata connections, or the TCP traffic within a given telnet session) fluctuate 
over many scales. 

However, all such measurements of local "traffic" have built-in limitations as 
to how much they can fluctuate. Indeed examination of the data shows a fairly 
complicated picture: the bulk of the traffic is not "self-similar" or pathological; 
only some tails show power-law scaling over a limited range. See Figure 1. Even 
a moderate amount of scaling behaviour can certainly create a lot of trouble 
for engineers (who have to design equipment to handle such contingencies); 
but it is not pervasive enough, nor is it defined over a large enough range, to 
qualify as "self-similar" behaviour in the sense in which it is usually used in the 
natural sciences. We could say that we are trying to make a measurement of 
precipitation during a storm: it will probably fluctuate and be gutsy, but in its 
bulk be statistically regular. 

We will do something different here: we will not focus on what we can mea- 
sure on a storm locally, but rather try to listen for echos of storms elsewhere. We 
will do so by focusing on a negligible portion of the local traffic: the ICMP error 
messages. These are packets that computers or routers send to each other on 
the internet to signal all sorts of traffic problems: "speak slower, you're breaking 
up" or "you can't get there from here" or "this bridge is backed up", etc. ICMP 
error messages are generated only when packets are "dropped", i.e., typically 
during congestions or blackouts, or simply to signal that a certain "place" does 
not exist and hence you can't get there. They are, thus, an indication of a 
deluge elsewhere, the locally received sound of faraway storms fl6|| . 

We measure ICMP error datagrams exchanged between our cluster and the 
outside of Rockefeller University. Error messages are rather rare, and hence 
it takes a very long time to accumulate a reasonable amount of events: our 
measurement accumulated 11118 datagrams in a span of 29 days. A histogram of 
packet inter-arrival times (Ati = ti+i — U) shows a distribution P(Ai) 1/Af; 
the scaling holds for about 7 decades, from 0.3 milliseconds to an hour: see 
Figure 2. It is only punctuated by what appear as "Dirac deltas": bursts with 
high periodicity embedded within the stream of error messages. These bursts 
typically correspond to "repeat transmissions" of the same error message. The 
distribution is much broader than traffic measurements, as shown in Table 1. 

The periodics peaks observed also show, on detailed inspection, some intrigu- 
ing features. The peaks are listed in Table 2. Several peaks are exceedingly thin; 
for instance, the 64 seconds peak has a half-width of 1 millisecond. However, 
most of the events in this peak were generated by a router immediately outside 
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Table 1: Distribution Widths. 
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lb 


R.U. 


External Traffic 


44.891 
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5.82 


lc 


BellCore 


External Traffic 


2.524 


22.372 


8.86 


Id 


R.U. 


ICMP Errors, ext 


239.511 


32000 


133.6 


2a 



We define the width of a distribution as the quotient between the 75% percentile 
and the 25% percentile: it thus indicates the range within which the middle 
half of the probability is contained. The value for a Poisson point process is 
log(l/4)/log(3/4) = 4.82, and the value for a Pareto distribution (G(x) = 1 - 
(f ) fc ) is 3 1 / 6 . The width is not translation invariant (as the standard deviation), 
but it is scale invariant; unlike the standard deviation, it is always defined. 
Times for the percentiles in milliseconds. 



Table 2: Peaks 
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22 


in 


14 


2 


host 


far routers 



the "walls" of Rockefeller, just three hops away, which explains the timing ac- 
curacy. Not all peaks are that easily explained. The 24 seconds peak also has 
a half-width of 1 millisecond; the 84 events within the half-width were gener- 
ated by 48 different routers and leaf nodes; most of them are in excess of 18 
hops away, with latencies about 500ms, and some of them are as far away as 
Germany, France and England; 19 of them were not on the DNS name tables. 

It is interesting to note that P(t) « l/t is a highly anomalous scaling, a 
limiting case of Pareto distributions. Not only does it lack standard deviation 
and mean, it is not even normalizable in the absence of both short time and 
long time cutoffs. In fact, it is the only power law which shows a symmetry 
under exchange of short and long times: both are equally divergent. Using the 
transformation v = l/t and P(t)dt = P[y)dv, we get P{v) = 1/u because dv = 
dt/t 2 , so our measurement shows a discrete point process version of l/f noise. 
Thus, the distribution of times between consecutive error messages displays a 
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behaviour which has no free parameters at all, except for the cutoffs. No features 
or details of the engineering underneath are left in this background, no traces 
of any of the many protocols. 

Recently, Huberman and Lukose p7| analysed the issue of global jamming 
on the internet. They were able to show that a simple game of cooperation 
and defection by Internet users would cause global jams, even when they used 
a trivial network transport model and a very simplified model of individual 
activity. The jams, though, have a nonpathological waiting time distribution, 
perhaps because the transport model used ignored the highly structured, tree- 
like connectivity of the Internet. This connectivity structure can be shown 
to provide a bias towards 1/t behaviour in the kind of measurement reported 
here. As seen from a given computer (a "leaf node"), the jumble of routers 
connecting to the rest of the Internet looks pretty much like a tree, though 
there are a few cycles due to the return paths from routers and multiple routes 
to a given destination. This tree changes with time, as the structure of routing 
changes; but at any given time, it still looks pretty much like a tree. Counting 
how many routers are exactly n hops away from us, we observe the number to 
roughly double each hop (2l|, so modeling the net as a binary tree, while a gross 
oversimplification, is nevertheless somewhat metrically correct; the real tree has 
hugely varying connectivity. If a link goes down for whatever reasons, it will 
take a while until nearby routers learn how to route around it. An attempt to 
access a host routed through this link will result in a "Destination unreachable" 
error message. Assuming a nonpathological distribution (links go down and 
up as uncorrelated processes with identical probabilities per link), attempts to 
randomly access the leaves generates a 1/t distribution of waiting times between 
error messages, see simulation output on Figure 3. 

This model is, however, not enough to explain our data. Though the basic 
power law is the same, the cutoffs strongly disagree. The extent of the scaling 
region is determined by the number of levels in the tree; in order to fit this 
extent to our data, the Internet should have several billion nodes. The lowest 
cutoff is given by the frequency at which the root node attempts to access 
leaves; in order to match our lower cutoff of < 1 ms, our domain would have 
to be attempting to access thousands of different computers per second, which 
is orders of magnitude wrong. The model should thus not be understood as an 
explanation of 1/t behaviour in our data, but rather as a basic structural bias 
towards power-law behaviour that the geometry imposes on the system. 

In a classic study, S. Bellovin pj| described a "natural computer virus": a 
DNS cache-corruption virus. Hosts on the Internet rely upon DNS name servers 
to get the address of a computer based on its name, and viceversa; DNS servers 
communicate with one another to get this information. In addition, DNS servers 
will "cache" the names they've obtained from other nameservers for a while. 
If this cache or a portion thereof is corrupted, then the DNS server will serve 
incorrect addresses, not just to hosts, but to other DNS servers, thus propagating 
the errors. It is highly likely that DNS is not the only Internet service that 
can support such self-propagating entities; there might indeed be a veritable 
ecosystem evolving on the backwaters of the net. Interestingly, evolutionary 
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Table 3: Breakout by Protocol 
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Redirect 
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Redirect 
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11 
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3 


Dest unreachable 


3 


2497 


3 


Dest unreachable 


1 


3300 


3 


Dest unreachable 


13 



Code [2C] 



Bad protocol 
in reassembly 
Forbidden host 
Needed to fragment 
Forbidden net 
Bad net 
network 
host 

Incorrect 1 

in transit 
Bad port 
Bad host 
Communication 
administratively forbidden 



RFC1812 IS ] stipulates that Source Quench messages should not be sent. 



pressure on such potential beings would be to stay out of sight, since their being 
noticed would result in fixing of the software bug that allows them to propagate 
in the first place. The generic feature enabling such a state of confusion to 
propagate is the fact that a master server cannot determine that it is confused, 
and can't reply "I am or might be confused, ask someone else". Thus the server 
keeps speweing confusion. Quite a few services have such structure. 

As applied to our case, we have been measuring errors, and most of them are 
generated by routers, and routers pass to one another information that is deemed 
to be authoritative through RIP. So it would be interesting to assess whether 
the ICMP errors are correctly generated or are, themselves, erroneous. Table 3 
shows a breakout by protocol of our data set. A fair fraction of the errors seen 
there should not have been sent at all; source quench datagrams, for instance, 
have been deprecated for a long time. The redirect messages should not have 
made it out of their respective local net, and the redirect host class consists 
exclusively of messages like "to reach host x, use x as a gateway", which are 
obviously nonsensical [E3|. Time exceeded in transit are typically generating 
through erroneous routing loops |2^] . But the most frequent class of errors, 
Destination unreachable, poses an interesting problem: since these problems 
are typically transient, how can one evaluate whether the datagrams were sent 
correctly or in error? A way out is to notice that the most frequent subclass 
of errors is communication administratively forbidden. This should mean the 
host is in a protected subnet, typically behind a firewall, a condition which is 
definitely not transient. We generated a list of all the hosts to which an 
access attempt had resulted in this error in our dataset, and attempted access 
all over again. We got through without errors to 152 hosts out of 320 (« 50%), 
which means that these hosts were not placed behind administrative restrictions. 
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These hosts were responsible for 2297 error datagrams out of the 3300 (« 70%). 

Thus we can say that, in all likelihood, more than a half of our data 
set consists of error messages which were themselves generated erro- 
neously, through incorrect routing. Obviously, router confusion dominates over 
actual physical errors. These states of confusion could last substantially longer 
than actual link or host downtimes, and perhaps even spread or self-organize in 
some fashion. This may be an underlying cause of power-law organization in 
the data. Let us also recall that a fundamental law of error correction circuitry 
design is that the error correction components are just as fallible as the rest of 
the circuit. Parity correction algorithms have to assume the error might very 
well be in the parity bit, for instance. This fundamental notion has not been 
implemented in communication protocols like RIP or ICMP: there is no ICMP 
message reading "I'm confused". 

As use of the Internet continues to spread, and as research and educational 
institutions prepare for the Next Generation Internet and Internet2 projects, it 
becomes more important to understand the global, large scale dynamics of our 
world-wide network. We've shown that important clues to this dynamics may 
lie in rather insignificant fractions of the overall traffic of the Net. 

I would like to thank G. Cecchi, D. Chialvo, J. -P. Eckmann, M. Feigenbaum, 
A. Libchaber and E. Siggia. This work was supported in part by the Sloan 
Foundation. 
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Figure Captions 

Figure 1: Cumulative probability distributions G(At) of packet interarrival 
times for four different traffic measurements: (a) 150000 packets of local traffic 
at our cluster, (b) one million packets of local traffic at Bellcore (c) 150000 
packets of external traffic at out cluster, (d) one million packets of external 
traffic at Bellcore. Traces (a) and (c) were measured locally for this study; 
traces (b) and (d) were studied in || 0, |[ || and are publicly available |1S|| . 

Figure 2: (a) Cumulative probability distribution G(Ai) and probability 
density function -P(Ai) for the interarrival times of ICMP error datagrams be- 
tween our cluster and the outside of the Rockefeller campus. Notice that the 
probability density follows -P(Ai) « 1/t for about 7 decades, except for sharp 
Dirac-5-like peaks, (b) Probability density for a two month trace over two sub- 
nets. 

Figure 3: Distribution of inter-event times in a simulation of network access 
errors due to down links on a binary tree. The tree had 29 levels, so there were 
2 29 leaf nodes and 2 29 — 2 routers and links. At any given time 100 links are 
down; they stay down for an average of 10 5 iterations. 
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